df = pd.read_parquet("dataset.parquet")Check out the first post of the series here, which covers the theory and foundations necessary to understand what’s going on in this second post.
This is a true story of how I lost money using machine learning (ML) to bet on CS:GO. The original idea and implementation came from a friend, who gave me permission to share this story in public.
In this post, I will go over the actual implementation of the solution:
- CS:GO
- Data scraping
- Feature engineering
- TrueSkill (with a side note on inferential vs predictive models)
- Modelling
- Evaluation
- Backtesting
- Why I lost 1000 euros
Solution
CS:GO
Counter-Strike: Global Offensive (CS:GO) is a first-person shooter (FPS) multiplayer game. It can be played casually or competitively. When played competitively, it’s typically played on the following format:
- Two teams of 5 play against each other: Terrorists vs Counter-Terrorists
- Best of 3 maps (sometimes best of 1 or 5)
- Maps are played up to 30 rounds
- Each round can be won be killing the other team or by planting or defusing the bomb
- Each player has a number of kills (K), deaths (D), assists (A) and average damage per round (ADR)
If you don’t know much about videogames, don’t worry, you can treat CS:GO as any other team sport.

Web scraping
Data is the new oil.
As I explained in the first post, one of the reasons we chose CS:GO was data availability. Since we broke some term and conditions, I won’t name our exact sources, but they were easily found online.
We collected both match data and betting odds. Note that match data is easy to find retroactively, but betting odds need to be collected in real-time, which limited our ability to run backtests (more on this later).
We collected 3 years worth of match data over 30k matches. We managed to collect only 3 months of betting odds data, covering 1725 matches, with approximately 30 odds per match.
Match data contained information such as the teams playing, team composition, kills and deaths for each player, rounds won, the map to be played, and the final score (win-loss-tie).
To scrape the data that we needed we used Selenium through a headless browser. Then, we parsed the resulting HTML with BeautifulSoup.
Feature engineering
Past behavior is the best predictor of future behavior.
With the match data, we created 100s of features1. Most features were related to past performance, such as the percentage of times team 1 won on the map to be played or against team 2. If the teams faced each other off before, who won back then is an important predictor now. We also used game score features like KD difference and ADR on a team and individual basis.
Note that we couldn’t use the betting odds as features, even though the information there is invaluable (see this work which shows there is alpha by just averaging betting odds from different bookmakers). The reason is simple, as previously explained: we didn’t have a backfill for historical betting odds. We could only use the odds that were available after we started to collect them, which was only (barely) enough for backtesting.
Also, we had a trump card, which ended up being the most important feature: TrueSkill.
TrueSkill
TrueSkill is a Bayesian skill rating system developed by Microsoft for multiplayer games, similar but arguably better than the ELO rating. It aims to estimate the “true skill” of each player or team based on their performance history.
TrueSKill uses a Gaussian distribution to represent the skill level of each player, and it updates these skill levels after each match using Bayesian2 updates. TrueSkill provides not just the ability but also the uncertainty around each player’s skill, both of which can be used as features in a ML model.
Inferential vs predictive models
There are two cultures in the use of statistical modeling to reach conclusions from data. -Leo Breiman3
If we have TrueSkill, which predicts the win probability between two teams, why do we even need a ML model? TrueSkill is an inferential model, which attempts to explain the world through latent variables. Of course, a perfect model of the world would also make great predictions but, in practice, there is always a trade-off between explainability and predictive power. That is the biggest tension between statistics and ML.
ML models are typically less interpretable black boxes but much more powerful at making predictions. They can incorporate a wide range of features, including but not limited to those provided by TrueSkill, with the sole focus of optimising a loss function.
Dataset
Here is the matches dataset with all the features and target together, including the TrueSkill features. I don’t show the actual feature engineering calculation for the sake of brevity, as this post is long enough as it is.
Modelling
XGBoost is all you need. -Bojan Tunguz
The modelling done here is pretty standard tabular ML with a couple of notable exceptions:
- We remove ties, which represent roughly 1.5% of the dataset
- We do data augmentation by swapping team1 and team2 features and adding both rows to the training set
- We can do that as there is no “home advantage” in CS:GO like there is in football
- When making predictions, we average the predictions across both scenarios
Out-of-time train-test split
We use out-of-time split instead of the more typical cross-validation. In pretty much any real-life application, a model is trained with past data and ends up used to predict future unseen data. Your evaluation should reflect that, as you might be interested to know how your model performance degrades over time (which could be caused, for example, by concept drift).
If you intend to re-train your model periodically, say, weekly, you could evaluate that training strategy by simply using a time-series cross validation, where you train the model with data up to week X and predict on data of week X+1.
Code
dataset = pd.read_parquet("dataset.parquet").drop_duplicates()
dt_train = '2019-01-01'
dt_test = '2019-08-01'
dataset['target'] = (dataset['winner'] == 'team1').astype(bool)
dataset = dataset[
(dataset['match_date'] >= '2017-01-01') &
(dataset['winner'] != 'tie') &
(dataset['match_id'] != 'https://www.hltv.org/matches/2332976/lucid-dream-vs-alpha-red-esl-pro-league-season-9-asia')
].reset_index(drop=True)
mask_train = dataset['match_date'] < dt_train
dataset_train = dataset.loc[mask_train].reset_index(drop=True)
dataset_train2 = dataset_train.sample(frac=1).reset_index(drop=True)
dataset_train2['target'] = ~dataset_train2['target']
cols = []
# Swapping team features
for c in list(dataset_train.columns):
if c.startswith('team1_'):
cols.append(c.replace('team1_', 'team2_').replace('_team2', '_team1'))
elif c.startswith('team2_'):
cols.append(c.replace('team2_', 'team1_').replace('_team1', '_team2'))
else:
cols.append(c)
dataset_train2 = dataset_train2.rename(columns=dict(zip(dataset_train.columns, cols)))
dataset_train = dataset_train[cols]
dataset_train2 = dataset_train2[cols]
dataset_train = pd.concat([dataset_train, dataset_train2], axis=0, ignore_index=True).reset_index(drop=True)
idxs = np.random.choice(len(dataset_train), replace=False, size=4000)
dataset_val = dataset_train.loc[idxs].drop_duplicates('match_id').reset_index(drop=True)
dataset_val = dataset_val.reset_index(drop=True)
index = np.arange(len(dataset_train))
mask = ~np.in1d(index, idxs)
dataset_train = dataset_train.loc[mask].reset_index(drop=True)
mask_test = (
(dataset['match_date'] >= dt_train) &
(dataset['match_date'] < dt_test)
)
dataset_test = dataset.loc[mask_test].reset_index(drop=True)dataset_train.shape, dataset_val.shape, dataset_test.shape((38490, 267), (3802, 267), (4759, 267))
Code
dataset['match_date'] = pd.to_datetime(dataset['match_date'])
# Create weekly match counts
match_counts = dataset.groupby(dataset['match_date'].dt.to_period('W')).size().reset_index(name='count')
match_counts['match_date'] = match_counts['match_date'].dt.to_timestamp()
# Define color for each period
match_counts['period'] = 'Train'
match_counts.loc[match_counts['match_date'] >= pd.to_datetime(dt_train), 'period'] = 'Test'
# For validation, we'll consider it as part of the train set but with a different color
val_mask = dataset_val['match_date'].dt.to_period('W').value_counts().reset_index()
val_mask.columns = ['match_date', 'val_count']
match_counts = match_counts.merge(val_mask, on='match_date', how='left')
match_counts['val_count'] = match_counts['val_count'].fillna(0)
match_counts.loc[match_counts['val_count'] > 0, 'period'] = 'Validation'
# Create the plot
fig = px.line(match_counts, x='match_date', y='count', color='period',
title='Number of Matches Over Time (Weekly)',
labels={'count': 'Number of Matches', 'match_date': 'Date'},
color_discrete_map={'Train': 'blue', 'Validation': 'green', 'Test': 'red'})
# Add vertical lines for train/test split
fig.add_vline(x=dt_train, line_dash="dash", line_color="gray")
# Add annotation for the train/test split
fig.add_annotation(x=dt_train, y=1, yref="paper", showarrow=False,
text="Train/Test Split", textangle=-90, xanchor="right")
# Update layout for better readability
fig.update_layout(
legend_title_text='Dataset',
xaxis_title="Date",
yaxis_title="Number of Matches per Week",
)
fig.show()Model: LightGBM
We use a standard off-the-shelf LightGBM binary classifier. There are many advantages to use LightGBM or XGBoost for tabular data problems (either choice is fine!):
- Handles missing values natively
- Handles categorical features natively
- Early stopping to optimize the number of estimators
- Blazing fast and scalable
- Multiple loss functions options, including using a custom one
- For binary classification, the default is the negative logloss (a proper scoring rule, which should lead to well-calibrated probabilities)
For more information on how to unlock the power of LightGBM, watch my PyData London 2022 presentation.
from lightgbm.callback import early_stopping, log_evaluationclass CSGOPredictor(object):
def __init__(self, model_params):
self.model_params = model_params
def fit(self, x_train, y_train, x_val, y_val):
self.lgb = LGBMClassifier(**self.model_params)
self.lgb.fit(
x_train, y_train,
eval_set=[(x_train, y_train), (x_val, y_val)],
eval_names=['training', 'validation'],
callbacks=[
early_stopping(stopping_rounds=50),
log_evaluation(period=25), # Log every 25 iterations
]
)
return self
def predict_proba(self, x):
# Predictions are done twice and then averaged, with swapped team features
original = self.lgb.predict_proba(x)
x_inv = x.copy()
team1_cols = [i for i in x_inv.columns if i.startswith('team1')]
team2_cols = [i for i in x_inv.columns if i.startswith('team2')]
x_inv = x_inv.rename(dict(zip(team1_cols + team2_cols, team2_cols + team1_cols)), axis=1)
x_inv = x_inv.reindex(columns=x.columns)
inv = self.lgb.predict_proba(x_inv)
inv[:, 0], inv[:, 1] = inv[:, 1], inv[:, 0].copy()
return (original+inv)/2.0
def predict(self, x):
return self.predict_proba(x).argmax(axis=1)Code
drop_cols = ['winner', 'match_date', 'match_id', 'event_id', 'team1_id', 'team2_id', 'target']
x_train = dataset_train.drop(columns=drop_cols, axis=1)
y_train = dataset_train['target']
features = list(x_train.columns)
x_val = dataset_val[features]
y_val = dataset_val['target']
x_test = dataset_test[features]
y_test = dataset_test['target']
model_params = {
'n_estimators': 10_000,
'learning_rate': 0.05
}
model = CSGOPredictor(model_params).fit(x_train, y_train, x_val, y_val)Training until validation scores don't improve for 50 rounds
[25] training's binary_logloss: 0.589321 validation's binary_logloss: 0.604284
[50] training's binary_logloss: 0.562941 validation's binary_logloss: 0.588552
[75] training's binary_logloss: 0.547649 validation's binary_logloss: 0.584521
[100] training's binary_logloss: 0.535844 validation's binary_logloss: 0.583577
[125] training's binary_logloss: 0.525761 validation's binary_logloss: 0.5834
[150] training's binary_logloss: 0.515957 validation's binary_logloss: 0.582886
[175] training's binary_logloss: 0.506858 validation's binary_logloss: 0.582916
[200] training's binary_logloss: 0.49788 validation's binary_logloss: 0.58255
[225] training's binary_logloss: 0.488956 validation's binary_logloss: 0.581818
[250] training's binary_logloss: 0.48075 validation's binary_logloss: 0.582284
[275] training's binary_logloss: 0.472855 validation's binary_logloss: 0.581823
[300] training's binary_logloss: 0.46497 validation's binary_logloss: 0.582013
Early stopping, best iteration is:
[271] training's binary_logloss: 0.474098 validation's binary_logloss: 0.581628
Feature importance
Here is the “beeswarm” view of SHAP values. It shows not just the importance but also how each feature relates to the prediction logits4:
explainer = shap.Explainer(model.lgb)
shap_values = explainer(x_test)shap.plots.beeswarm(shap_values, max_display=20)
Unsurprisingly, the TrueSkill win probability features are the most important ones. In a sense, this can be seen as a form of stacking, since TrueSkill is another model. Other important features relate to the team’s past performance, like KD ratio and ADR.
Are ~250 features really necessary? Probably not, especially with just 30k samples5. We didn’t do any feature selection, but I’d do permutation importance and adversarial validation on a time split with more time on my hands6.
Evaluation
We evaluate using the following metrics:
- Accuracy: how many bets you expect to get right
- AUC7: how well you rank-order the winners/losers
- Brier score: a metric takes both calibarion and accuracy into account
I also plot the calibration curves for the training and test sets.
Code
def calculate_metrics(X, y, model):
y_pred_proba = model.predict_proba(X)[:, 1]
y_pred = model.predict(X)
return {
'Accuracy': accuracy_score(y, y_pred),
'AUC': roc_auc_score(y, y_pred_proba),
'Brier_score': brier_score_loss(y, y_pred_proba)
}
metrics_train = calculate_metrics(x_train, y_train, model)
metrics_val = calculate_metrics(x_val, y_val, model)
metrics_test = calculate_metrics(x_test, y_test, model)
metrics_df = pd.DataFrame([metrics_train, metrics_val, metrics_test],
index=['Training', 'Validation', 'Test'])metrics_df| Accuracy | AUC | Brier_score | |
|---|---|---|---|
| Training | 0.776955 | 0.867345 | 0.155439 |
| Validation | 0.734087 | 0.816379 | 0.176430 |
| Test | 0.704980 | 0.771550 | 0.191545 |
Code
def plot_calibration_curve(y_true, y_pred_proba, set_name, fig, color):
mean_predicted_value, fraction_of_positives = calibration_curve(y_true, y_pred_proba, n_bins=10)
fig.add_trace(go.Scatter(
x=mean_predicted_value, y=fraction_of_positives,
mode='lines+markers', name=f'{set_name} set',
line=dict(color=color)
))
# Create a new figure for the calibration plot
calibration_fig = go.Figure()
# Add the perfectly calibrated line
calibration_fig.add_trace(go.Scatter(
x=[0, 1], y=[0, 1],
mode='lines', name='Perfectly calibrated',
line=dict(dash='dot')
))
# Plot calibration curve for the training set
plot_calibration_curve(y_train, model.predict_proba(x_train)[:, 1], 'Training', calibration_fig, 'blue')
# Plot calibration curve for the test set
plot_calibration_curve(y_test, model.predict_proba(x_test)[:, 1], 'Test', calibration_fig, 'red')
# Set layout properties for the calibration plot
calibration_fig.update_layout(
title="Calibration plot",
xaxis_title="Mean predicted value",
yaxis_title="Fraction of positives",
xaxis=dict(tickvals=[i/10 for i in range(11)], range=[0, 1]),
yaxis=dict(tickvals=[i/10 for i in range(11)], range=[0, 1]),
showlegend=True
)
calibration_fig.show()The model seems well calibrated, which makes it useful for betting: recall from the previous post that our betting decision rule is based on the probability of team 1 or 2 winning.
If the model wasn’t well calibrated, we could use Isotonic regression on a validation set to fix calibration issues. There are other options for post-hoc model calibration like Platt scaling, but Isotonic regression works best for tree-based models.
Code
def auc_over_time(df, model, date_col, target_col, features):
# Make a copy to avoid modifying the original dataframe and convert match_date to datetime
weekly_df = df.copy()
weekly_df[date_col] = pd.to_datetime(weekly_df[date_col])
# Create a 'week_start_date' column for grouping that represents the start of the week
weekly_df['week_start_date'] = weekly_df[date_col].dt.to_period('W').apply(lambda r: r.start_time)
# Initialize a dictionary to store AUC for each week
weekly_auc = {}
for week_start_date, group in weekly_df.groupby('week_start_date'):
if not group.empty:
X = group[features]
y = group[target_col]
auc = roc_auc_score(y, model.predict_proba(X)[:, 1])
weekly_auc[week_start_date] = auc
return pd.Series(weekly_auc)
def acc_over_time(df, model, date_col, target_col, features):
# Make a copy to avoid modifying the original dataframe and convert match_date to datetime
weekly_df = df.copy()
weekly_df[date_col] = pd.to_datetime(weekly_df[date_col])
# Create a 'week_start_date' column for grouping that represents the start of the week
weekly_df['week_start_date'] = weekly_df[date_col].dt.to_period('W').apply(lambda r: r.start_time)
# Initialize a dictionary to store AUC for each week
weekly_auc = {}
for week_start_date, group in weekly_df.groupby('week_start_date'):
if not group.empty:
X = group[features]
y = group[target_col]
auc = accuracy_score(y, model.predict(X))
weekly_auc[week_start_date] = auc
return pd.Series(weekly_auc)Code
# Calculate weekly AUC for training and test sets
weekly_auc_train = auc_over_time(dataset_train, model, 'match_date', 'target', features)
weekly_auc_test = auc_over_time(dataset_test, model, 'match_date', 'target', features)
# Plotting the AUC over time using Plotly
trace0 = go.Scatter(
x=weekly_auc_train.index,
y=weekly_auc_train.values,
mode='lines+markers',
name='Training Set',
line=dict(color='blue')
)
trace1 = go.Scatter(
x=weekly_auc_test.index,
y=weekly_auc_test.values,
mode='lines+markers',
name='Test Set',
line=dict(color='red')
)
layout = go.Layout(
title='AUC Over Time',
xaxis=dict(title='Week Start Date'),
yaxis=dict(title='AUC'),
showlegend=True
)
fig = go.Figure(data=[trace0, trace1], layout=layout)
fig.add_hline(y=0.5, line_dash="dash", line_color="black",
annotation_text="Random prediction", annotation_position="bottom right")
avg_train_auc = weekly_auc_train.mean()
avg_test_auc = weekly_auc_test.mean()
# Training set average line for the training period
fig.add_shape(type='line',
x0=weekly_auc_train.index.min(), y0=avg_train_auc,
x1=weekly_auc_train.index.max(), y1=avg_train_auc,
line=dict(dash='dash', color='blue', width=2),
xref='x', yref='y')
# Test set average line for the test period
fig.add_shape(type='line',
x0=weekly_auc_test.index.min(), y0=avg_test_auc,
x1=weekly_auc_test.index.max(), y1=avg_test_auc,
line=dict(dash='dash', color='red', width=2),
xref='x', yref='y')
# Add annotations for the averages
fig.add_annotation(x=weekly_auc_train.index.max(), y=avg_train_auc,
text=f"Train Avg: {avg_train_auc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.add_annotation(x=weekly_auc_test.index.max(), y=avg_test_auc,
text=f"Test Avg: {avg_test_auc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.show()There is a train-test performance gap, which implies overfitting but that’s not a big concern per se. We really care that the out-of-time performance is good enough, which will be evaluated with the backtest below. Overfitting is normal with gradient-boosted trees model, but its generalization performance is still better than other models like logistic regression or random forests (I will leave model comparison as an exercise to the reader).
Also, note that there is a big drop in the last 3 weeks of the test dataset, implying maybe some kind of drift. That suggests we should not let the model go for more than 6 months without re-training.
Code
# Calculate weekly AUC for training and test sets
weekly_acc_train = acc_over_time(dataset_train, model, 'match_date', 'target', features)
weekly_acc_test = acc_over_time(dataset_test, model, 'match_date', 'target', features)
# Plotting the AUC over time using Plotly
trace0 = go.Scatter(
x=weekly_acc_train.index,
y=weekly_acc_train.values,
mode='lines+markers',
name='Training Set',
line=dict(color='blue')
)
trace1 = go.Scatter(
x=weekly_acc_test.index,
y=weekly_acc_test.values,
mode='lines+markers',
name='Test Set',
line=dict(color='red')
)
layout = go.Layout(
title='Accuracy Over Time',
xaxis=dict(title='Week Start Date'),
yaxis=dict(title='Accuracy'),
showlegend=True
)
fig = go.Figure(data=[trace0, trace1], layout=layout)
fig.add_hline(y=0.5, line_dash="dash", line_color="black",
annotation_text="Random prediction", annotation_position="bottom right")
avg_train_acc = weekly_acc_train.mean()
avg_test_acc = weekly_acc_test.mean()
# Training set average line for the training period
fig.add_shape(type='line',
x0=weekly_acc_train.index.min(), y0=avg_train_acc,
x1=weekly_acc_train.index.max(), y1=avg_train_acc,
line=dict(dash='dash', color='blue', width=2),
xref='x', yref='y')
# Test set average line for the test period
fig.add_shape(type='line',
x0=weekly_acc_test.index.min(), y0=avg_test_acc,
x1=weekly_acc_test.index.max(), y1=avg_test_acc,
line=dict(dash='dash', color='red', width=2),
xref='x', yref='y')
# Add annotations for the averages
fig.add_annotation(x=weekly_acc_train.index.max(), y=avg_train_acc,
text=f"Train Avg: {avg_train_acc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.add_annotation(x=weekly_acc_test.index.max(), y=avg_test_acc,
text=f"Test Avg: {avg_test_acc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.show()The accuracy plot is is similar to AUC in all aspects. Note that we’re much better than predicting at random, but that is not a good baseline here. A much better baseline would be the accuracy calculated with the probabilities implied by the betting odds.
Backtesting
Past performance is no guarantee of future results.
Backtesting is replaying the past with your model decisions. One example of backtesting is the following:
- Train model with data up to a certain date
- Sample betting odds for the next matches
- Make bets for those next matches according to your betting strategy
- Repeat 1-3 until you cover all the test data
- Evaluate ML metrics (e.g. AUC) and business metrics (e.g ROI) on your bets
Backtesting allows us to assess our financial performance, which matters a lot more than ML metrics. For example, is an AUC of 0.77 good or bad? That is hard to tell in general, while a ROI of 1.1 is something we can understand and compare to other strategies (including leaving your money in the bank to earn risk-free interest).
Here, we only assess the ROI of the bets, not other financial metrics like the Sharpe ratio or max drawdown.
For simplicity, we just train the model once and keep it fixed for all future bets, which makes it a more conservative backtest.
First, let’s download the dataset with matches with betting odds:
Code
dataset_with_odds = pd.read_parquet("match_predictions_with_odds.parquet")
dataset_with_odds = dataset_with_odds[["match_id", "team1_odds", "team2_odds"]]
dataset_with_odds = dataset_with_odds.merge(dataset_test, on="match_id")
dataset_with_odds['match_date'] = pd.to_datetime(dataset_with_odds['match_date'])
dataset_with_odds = dataset_with_odds.sort_values(by='match_date')(1113837, 269)
dataset_with_odds.shapeNow, let’s simulate our betting strategy:
- For each match, sample just one betting odd at random
- Only bet if winning probability is over 50% AND
- Only bet if the probability of winning is greater than the implied probability by the odds plus a delta of 1%
- The bet can either be a fixed amount or determined by the Kelly criterion (here, for simplicity, I only show fixed betting – see previous blog post for a discussion on the Kelly criterion and some variants)
The first premise sounds odd: shouldn’t we pick the best possible betting odd? Not really, for two real-life reasons: 1. For risk management, you don’t want to bet multiple times on the same match 2. You might not be able to bet when you want for multiple reasons (e.g. you are asleep).
There was some trial and error involved in designing our betting strategy and I’m sure there is room for improvement. The delta of 1% is our safety margins due to model error and we found it with a grid search. It’s a parameter you can play with in the simulation below:
MIN_PROBA = 0.5
MIN_DELTA_PROBA = 0.01
N_SIMS = 200
all_samples_data = []
for _ in range(N_SIMS):
df = dataset_with_odds.groupby('match_id').apply(lambda x: x.sample(1)).reset_index(drop=True)
predict_proba = model.predict_proba(df[features])
df['team1_proba'] = predict_proba[:, 1]
df['team2_proba'] = predict_proba[:, 0]
df["team1_implied_prob"] = 1 / df["team1_odds"]
df["team2_implied_prob"] = 1 / df["team2_odds"]
df["team1_bet"] = (df.team1_proba > MIN_PROBA) & (df.team1_proba > (df.team1_implied_prob + MIN_DELTA_PROBA))
df["team2_bet"] = (df.team2_proba > MIN_PROBA) & (df.team2_proba > (df.team2_implied_prob + MIN_DELTA_PROBA))
df["team1_returns"] = np.where(df.team1_bet & (df.winner=='team1'), df["team1_odds"], 0.0)
df["team2_returns"] = np.where(df.team2_bet & (df.winner=='team2'), df["team2_odds"], 0.0)
df["loss"] = df["team1_bet"].astype(int) + df["team2_bet"].astype(int)
df["revenue"] = df["team1_returns"] + df["team2_returns"]
df["profit"] = df["revenue"] - df["loss"]
all_samples_data.append(df)Code
all_samples_df = pd.concat(all_samples_data).reset_index(drop=True)
all_samples_df['match_date'] = pd.to_datetime(all_samples_df['match_date'])
all_samples_df.sort_values(by='match_date', inplace=True)
all_samples_df['cumulative_profit'] = all_samples_df.groupby('match_date')['profit'].cumsum()
daily_profit_sum = all_samples_df.groupby('match_date')['profit'].sum().reset_index()
daily_profit_sum['cumulative_profit'] = daily_profit_sum['profit'].cumsum()/N_SIMS
total_profits = all_samples_df['profit'].sum()
total_bets = all_samples_df['loss'].sum() # This assumes that 'loss' is the number of bets in the all_samples_df
roi = total_profits / total_bets if total_bets > 0 else 0
# Calculate the annualized ROI
min_date = all_samples_df['match_date'].min()
max_date = all_samples_df['match_date'].max()
duration_years = (max_date - min_date) / pd.Timedelta(days=365.25)
annualized_roi = (roi + 1) ** (1 / duration_years) - 1 if duration_years > 0 else 0print(f"Backtest ROI: {round(roi*100)}%")
print(f"Annualized ROI: {round(annualized_roi*100)}%")Backtest ROI: 10%
Annualized ROI: 63%
The ROI after 2 months is 10%, which annualized would be 63%, not bad at all! For reference, the risk free interest rate in the US today is around 5% per year, while the average S&P500 returns are roughly 10% a year.
We did have an edge after all, or so it seemed. Let’s see the uncertainty across multiple simulations:
Code
# Create a Plotly figure
fig = go.Figure()
# Add traces for each sample's cumulative profits
for sample_data in all_samples_data:
# Make sure to sort the sample_data by 'match_date'
sample_data_sorted = sample_data.sort_values(by='match_date')
fig.add_trace(go.Scatter(
x=sample_data_sorted['match_date'],
y=sample_data_sorted['profit'].cumsum(),
mode='lines',
line=dict(width=1, color='lightgrey'),
showlegend=False
))
# Add a trace for the average cumulative profits per date
fig.add_trace(go.Scatter(
x=daily_profit_sum['match_date'],
y=daily_profit_sum['cumulative_profit'],
mode='lines',
name='Avg Cum. Profits',
line=dict(width=3, color='blue')
))
# Adding ROI text
fig.add_trace(go.Scatter(
x=[daily_profit_sum['match_date'].iloc[-1] + pd.DateOffset(days=4)],
y=[daily_profit_sum['cumulative_profit'].iloc[-1]],
text=[f"ROI: {roi:.2f}"], # The ROI text
mode="text",
showlegend=False,
textfont=dict( # Adjust the font properties here
size=14,
color='black',
)
))
# Update layout to add titles and make it more informative
fig.update_layout(
title="Cumulative Profits over Time with Average",
xaxis_title="Match Date",
yaxis_title="Cumulative Profit",
legend_title="Legend",
template="plotly_white",
xaxis=dict(
type='date' # Ensure that x-axis is treated as date
)
)
# Show the figure
fig.show()